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Abstract 

Principal component analysis (PCA) algorithms use neural net- 
works to extract the eigenvectors of the correlation matrix from the 
data. However, if the process is non-Gaussian, PCA algorithms or 
their higher order generalisations provide only incomplete or mislead- 
ing information on the statistical properties of the data. To handle 
such situations we propose neural network algorithms, with an hybrid 
(supervised and unsupervised) learning scheme, which constructs the 
characteristic function of the probability distribution and the tran- 
sition functions of the stochastic process. Illustrative examples are 
presented, which include Cauchy and Levy-type processes. 
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1 Introduction 



Let Xi denote the output of node i in a neural network. Hebbian learning 

(Hebb 1949) is a type of unsupervised learning where the neural network 
connection strengths Wij are reinforced whenever the products large. 
If Q is the correlation matrix 

Qij — (•^i'^j) (-^) 

and the Hebbian learning law is local, all the lines of the connection ma- 
trix Wij will converge to the eigenvector of Q with the largest eigenvalue. 
To obtain other eigenvector directions requires non-local laws (Sanger 1989, 
Oja 1989, 1992, Dente and Vilela Mendes 1996). These principal component 
analysis (PCA) algorithms find the characteristic directions of the correlation 
matrix Q. If the data has zero mean ((xj) = 0) they are the orthogonal direc- 
tions along which the data has maximum variance. If the data is Gaussian in 
each channel, it is distributed as a hyperellipsoid and the correlation matrix 
Q already contains all information about the statistical properties. This is 
because higher order moments of the data may be obtained from the second 
order moments. However, if the data is non-Gaussian, the PCA analysis is 
not complete and higher order correlations are needed to characterise the sta- 
tistical properties. This led some authors (Softy and Kammen 1991, Taylor 
and Coombes 1993) to propose networks with higher order neurons to obtain 
the higher order statistical correlations of the data. An higher order neuron 
is one that is capable of accepting, in each of its input lines, data from two 
or more channels at once. There is then a set of adjustable strengths Wij , 
^ijih T ' ' ' ■> ^iji-jn J being the order of the neuron. Networks with higher 
order neurons have interesting applications, for example in fitting data to 
a high-dimensional hypersurface. However there is a basic weakness in the 
characterisation of the statistical properties of non-Gaussian data by higher 
order moments. Existence of the moments of a distribution function depends 
on the behaviour of this function at infinity and it frequently happens that a 
distribution has moments up to a certain order, but no higher ones. A well- 
behaved probability distribution might even have no moments of order higher 
than one (the mean). In addition a sequence of moments does not necessarily 
determine a probability distribution function uniquely (Lukacs 1970). Two 
different distributions may have the same set of moments. Therefore, for 
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non-Gaussian data, the PGA algorithms or higher order generahsations may 
lead to misleading results. 

As an example consider the two-dimensional signal shown in Fig.l. Fig. 
2 shows the evolution of the connection strengths Wu and W12 when this 
signal is passed through a typical PGA algorithm. Large oscillations appear 
and finally the algorithm ovcrfiows. Smaller learning rates do not introduce 
qualitative modifications in this evolution. The values may at times appear 
to stabihse, but large spikes do occur. The reason for this behaviour is that 
the seemingly harmless data in Fig.l is generated by a linear combination of 
a Gaussian with the following distribution 



which has first moment, but no moments of higher order. 

To be concerned with non-Gaussian processes is not a pure academic 
exercise because, in many applications, adequate tools are needed to analyse 
such processes. For example, processes without higher order moments, in 
particular those associated with Levy statistics, are prominent in complex 
processes such as relaxation in glassy materials, chaotic phase diff'usion in 
Josephson junctions and turbulent diffusion (Shlesinger et al 1993, Zumofen 
and Klaftcr 1993, 1994). 

Moments of an arbitrary probability distribution may not exist. However, 
because every bounded and measurable function is integrable with respect to 
any distribution, the existence of the characteristic function f{a) is always 
assured (Lukacs 1970). 



where a and x are N-dimensional vectors, x is the data vector and F{x) its 
distribution function. 

The characteristic function is a compact and complete characterisation of 
the probability distribution of the signal. If, in addition, one wishes to de- 
scribe the time correlations of the stochastic process x{t), the corresponding 
quantity is the characteristic functional (Hida 1980) 





(2) 




(3) 
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where ^ (t) is a smooth function and the scalar product is 



= / dtx (t)e(t) 



(4) 



where fi (x) is the probabihty measure over the sample paths of the process. 

In the following we develop an algorithm to compute the characteristic 
function from the data, by a learning process. The main idea is that in 
the end of the learning process we should have a neural network which is a 
representation of the characteristic function. This network is then available 
to provide all the required information on the probability distribution of the 
data being analysed. To obtain full information on the stochastic process, 
a similar algorithm might be used to construct the characteristic functional. 
However this turns out to be computationally very demanding. Instead we 
develop a network to learn the transition function and from this the process 
may be characterised. 

2 Learning the characteristic function 

Suppose we want to learn the characteristic function f{a) (Eq. ^ of a one- 
dimensional signal x(t) in a domain a G [aojCtiv] • The a-domain is divided 
into N intervals by a sequence of values ao ai <y2 ■ ■ ■ c^n and a network is 
constructed with N+1 intermediate layer nodes and an output node (Fig. 3). 

The learning parameters in the network are the connection strengths Woi 
and the node parameters 6i. The existence of the node parameter means that 
the output of a node in the intermediate layer is 9iXi (a), Xi being a non- 
linear function. The use of both connection strengths and node parameters in 
neural networks makes them equivalent to a wide range of other connectionist 
systems (Doyne Farmer 1990) and improves their performance in standard 
applications (Dente and Vilela Mendes 1996). The learning laws for the 
network of Fig. 3 are: 



7,7] > . The intermediate layer nodes are equipped with a radial basis 



9, (t + 1) = 9, (t) + 7 (cosa.x (t) - 9, (t)) 
Wo, {t + l) = Wo, (t) + 
V Ej [Oj (t) - Ek Wok (t) Xk (aj) 9k (t)] 9i (t) Xi («,) 



(5) 
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function 

g-(Q-Q,)2/2a2 

where in general we use ai = a for all i. The output is a simple additive 
node. 

The learning constant 7 should be sufficiently small to insure that the 
learning time is much smaller than the characteristic times of the data x{t). 
If this condition is satisfied each node parameter 6i tends to (cosajx), the 
real part of the characteristic function /(a) for a = a,. 

The Wqi learning law was chosen to minimise the error function 

/ (^) = ^ E - E ^okXk e,^ (7) 

One sees that the learning scheme is an hybrid one, in the sense that the node 
parameter 6i learns, in an unsupervised way, (the real part of) the character- 
istic function f{ai) and then, by a supervised learning scheme, the W^iS are 
adjusted to reproduce the 9i value in the output whenever the input is a^. 
Through the learning law (^) each node parameter 6i converges to (cos aix) 
and the interpolating nature of the radial basis functions guarantees that, af- 
ter training, the network will approximate the real part of the characteristic 
function for any a in the domain [ao, otv]- 

A similar network is constructed for the imaginary part of the character- 
istic function, where now 

(t + 1) = (t) + 7 (sin aiX (t) - 6^ (t)) (8) 

For higher dimensional data the scheme is similar. The number of re- 
quired nodes is A^'^ for a ci- dimensional data vector x (t). For example for the 
2-dimensional data of Fig.l we have used a set of A^^ nodes (Fig. 4) 

Each node in the square lattice has two inputs for the two components 
«! and a2 of the vector argument of / {'ct)- The learning laws before 



Xi («) 



(t + l) = e^ij) (t) + 7 (cosa^)X (t) - Q(ij) (t) 



0(ij) 
J2{mn) W^O(mn.) 



(t + 1) = Woi^,) (t) + 



(t) X{mn) («(fc«j) ^(mn) (t) 0{ij) (t) X{ 



ij) I 



(9) 
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The pair (ij) denotes the position of the node in the square lattice and the 
radial basis function is 



X{ij) (tt) 



(10) 




Two networks are used, one for the real part of the characteristic function, 



Figs.5a-b shows the values computed by our algorithm for the real and 
imaginary parts of the characteristic function corresponding to the two- 
dimensional signal in Fig.l. On the left is a plot of the exact characteristic 
function and on the right the values learned by the network. In this case 
we show only the mesh corresponding to the 6i values. One obtains a 2.0% 
accuracy for the real part and 4.5% accuracy for the imaginary part. 

The convergence of the learning process is fast and the approximation is 
reasonably good. Notice in particular the slope discontinuity at the origin 
which reveals the non-existence of a second moment. The parameters used 
for the learning laws in this example were 7=0.00036, ?7=1.8, cr=0.25. The 
number of points in the training set is 25000. 

For a second example the data was generated by a Weierstrass random 
walk with probability distribution 



and b=1.31, which is a process of the Levy flight type. The character- 
istic function, obtained by the network, is shown in Fig. 6. Taking the 
log(— log)of the network output one obtains the scaling exponent 1.49 near 
a=0, close to the expected fractal dimension of the random walk path (1.5). 
The parameters used for the learning laws in this example were 7=0.0005, 
?7=1.75, (7=0.1732. The number of points in the training set is 80000. 

These examples test the algorithm as a process identifier, in the sense 
that, after the learning process, the network is a dynamical representation of 
the characteristic function and may be used to perform all kinds of analysis 
of the statistics of the data. There are other ways to obtain the characteristic 
function of a probability distribution, which may be found in the statistical 



another for the imaginary part with, in Eqs.(^, cos a^x (t) replaced by 





j=0 



(11) 
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inference literature (Prakasa Rao 1987). Our purpose in developing neural- 
like algorithms for this purpose was both to have a device that, after learning, 
is quick to evaluate and, at the same time, adjusts itself easily, through 
continuous learning, to changing statistics. As the PGA algorithms that 
extract the full correlation matrix, our neural algorithm laws are also non- 
local. As a computation algorithm this is not a serious issue, but for hardware 
implementations it might raise some problems. 

3 Identification of stochastic processes 

As we have stated before the full characterisation of the time structure of a 
stochastic process requires the knowledge of its characteristic functional (Eq. 
1^) for a dense set of functions C,(t). 

To construct an approximation to the characteristic functional we might 
discretize the time and the inner product in the exponential becomes a sum 
over the process sampled at a sequence of times. 



The problem would then be reduced to the construction of a multidimensional 
characteristic function as in Section 2. In practice we would have to limit 
the time-depth of the functional to a maximum of T time steps, TAt being 
the maximum time-delay over which time correlations may be obtained. If 
N is the number of different x values for each k, the algorithm developed in 
Section 2 requires N'^ nodes in the intermediate layer and, for any reasonably 
large T, this method becomes computationally explosive. 

An alternative and computationally simpler method consist in, restricting 
ourselves to Markov processes, to characterise the process by the construc- 
tion of networks to represent the transition function for fixed time intervals. 
From these networks the differential Chapman-Kolmogorov equation may 
then be reconstructed. Let x{t) be a one dimension Markov process and 
p {x2, t + At\xi,t) its transition function, that is, the conditional probability 
of finding the value X2 at time t + At given xi at time t. Assume further 
that the process is stationary 




(12) 



p {X2, t + At\xi, t) = p (X2, At|xi, t) 



(13) 
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The network that configures itself to represent this function is similar to 
the one we used for the 2-dimensional characteristic function. It is sketched 
in Fig.s 7a-b. It has a set of A^^ intermediate layer nodes with node pa- 
rameters, the node with coordinates corresponding to the arguments 
{x2{ij) = Xq + iAx,Xi{ij) = Xq + jAx) in the transition function. The do- 
main of both arguments is (xq, Xq + NAx). For each pair {x2 = x{t + At), Xi = 
in the data set, the node parameters that are updated are those in the 4 
columns containing the nearest neighbours of the point af = {x2,xi) (Fig. 
7b). The learning rule is 



x{t)) 



5',(t)%)(t) + 



%)(t + l) 



»?{ij)("a?)exp^-| /a 



S,{t+1) 



s^it + i) = s,it) + j2 



X — X 



/a 



X 



X 



(ki) 



I a 



(14) 



(15) 



where r](^ij'){lf) = 1 if (ij) is one of the nearest neighbours of the data point 
and zero otherwise, a is a neighbourhood smoothing factor. Sj{t) is a column 
normalisation factor. In the limit of large learning times the node parameters 
approach the transition function 



e 



(ij) 



p {xq + iAx, At\xo + jAx) 



(16) 



As for the networks in Section 2, the intermediate layer nodes are equipped 
with a radial basis function (Eq. |10D and the connection strengths in the 
output additive node have a learning law identical to the second equation 
in d). The role of this part of the network is, as before, to obtain an 
interpolating effect. 

What the algorithm of Eqs. (|I1|) and (|T3|) does is to compute recursively 
the average number of transitions between points in the configuration space 
of the process. The spatial smoothing effect of the algorithm automatically 
insures a good representation of a continuous function from a finite data set. 
Furthermore its recursive nature would be appropriate for the case of drifting 
statistics. 

For a stationary process, once the learning process has been achieved and 
if At is chosen to be sufficiently small, the network itself may be used to 
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simulate the stationary Markov process. A complete characterisation of the 
process may also be obtained by training a few similar networks for different 
(small) At values and computing the coefficient functions in the differential 
Chapman- Kolmogorov equation (Gardiner 1983). 

dtp {r, t'\t,t) = -J:^i- t')p{r, t'\t, t)) 

+ s., ai- t')pi^^ t)) (17) 

+ / dlt ^W{'^\lt,t')p{lt,t'\lf,t) - W{lt\^,t')p{'z^,t'\lf,t)^ 

The coefficients are obtained from the transition probabilities, noticing that 
for all e > 

2im^p("af , t + At|^, t) = iy( af 1^, t) for | if — ^| > e (18) 



lim / , . dx(xi - Zi)p('^,t + At\^,t) = Ai('t,t) + 0(e) (19) 

At^O At J\ X - Z \<e 



At^o 'At ~^ dx{xi — Zi){xj — Zj)p{lf', t + At|^, t) = Bij{'^, t) + 0(e) 

(20) 

W(li^\^ ,t) is the jumping kernel, y4j(^,t) the drift and Biji^ ,t) the diffu- 
sion coefficient. 

As an example we have considered a Markov process with jumping, drift 
and diffusion. A typical sample path is shown in Fig. 8. Three networks 
were trained on this process, to learn the transition function for t = At, 2 At 
and 3 At {At = 0.374ms). Fig. 9 shows the transition function for t = At 
and 3At. Fig. 10 shows two sections of the transition function for X2 = 0, 
that is, p{x, At\0, 0) and p{x, 3At|0, 0). 

The networks were then used to extract the coefficient functions A{x,t), 
B{x,t) and W{x\z,t). To find the drift A{x,t) we use Eq. (|T|). Fig. 11 
shows the computed drift function and a least square linear fit. Also shown 
is a comparison with the exact drift function of the process. 

To obtain the diffusion coefficient B{x,t) we use (|3)- Fig. 12 shows the 
diffusion coefficient for different At values. At is the smallest time step used 
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in the process simulation. Therefore B{x,t) = 2.6 is our estimate for the 
diffusion coefficient. In this case, because the diffusion coefficient is found to 
be a constant, the value of the jumping kernel W{x\z,t) is easily obtained 
by integration around the local maxima Xm of p{x, At\z) with |x — 2;| > 0.2. 



with S = 0.2. We conclude W{x\z) = 300S{\x - z\ - 0.5). The parameters 
used for the learning laws in this example were ?7=0.48, q;=0. 00021. The 
number of points in the training set is 1000000. 
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Fig. 1 - A two-dimensional test signal 
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Fig. 2 - Evolution of the connection strengths Wn and W12 in a PC A 
network for the data in Fig. 1 




Fig. 3 Network to learn the characteristic function of a scalar process 




Fig. 5a - Real part of the characteristic function for the data in Fig.l (left) 
and the mesh of 0/ values (right) obtained by the network. 



Fig. 5b - Imaginary part of the characteristic function for the data in Fig.l 
(left) and the mesh of di values (right) obtained by the network. 




Fig. 6-a Characteristic function for the Weierstrass random walk (b=1.31) 
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Fig. 6-b log(-log) of the characteristic function f( a) for the Weiers trass 
random walk (b= 1 . 3 1 ) 
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Fig. 7 (a) Network that learns the transition function p{x2,^t\xi) of a 
stationary Markov process; (b) Nearest neighbours of the data point 
X = {x{t + At),x{t)). 
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Fig. 8 Typical sample path of a stationary Markov process. 




a) b) 
Fig. 9 Transition functions obtained for a) t=3AT and b) AT. 
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Fig. 10 Transition function p(x,AT\0,0) a) and p(x,3AT\0,0) b). 
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Fig. 12 Diffusion function for AT (o), 2AT (+) and SAT (*). 



